Unsupervised Learning: Insurance Customer Segmentation through Cluster Analysis for Marketing and Pricing Decisions

Contributor: Jason Khoo

insurance1.jpg

Introduction

The business model of insurance companies revolves around the assumption and diversification of risk. One key revenue generating activity of insurance companies is the charging of premiums in exchange for insurance coverage. Through its underwriting process, insurance companies have amassed a large amount of data on their policyholders, including demographics, health and property data.

Advanced data analytics, such as clustering techniques, have provided the industry the opportunity to tap into the potential of the accumulated data to discover new methods to strengthen premium pricing strategies, conduct targeted marketing campaigns and formulate distribution strategy.

Problem Statement

In this assignment, I will attempt to perform a cluster analysis using K-means and K-medoids algorithms to achieve customer segmentation based on the health insurance policyholders data for future business decisions.

Dataset

Executive Summary

A. Set Up

B. Data Wrangling

C. Feature Selection

D. Feature Scaling and Principal Component Analysis (PCA)

E. K-means Clustering Implementation

F. K-medoids Clustering Implementation

G. Conclusion

A. Set Up

i. Import python packages

ii. Dataset loading

iii. Display Columns

iv. Predictors (Inputs) Selection

In this section, I will consider the following as relevant independent variables for my clustering models:

Charges (numerical) will not be used in my clustering model, however, it will be used as a guidance in my insurance premium pricing decision later on.

B. Data Wrangling

i. Recoding of Categorical Data

C. Feature Selection

Exploratory Data Analysis and Visualisation

i. Feature Selection using Correlation Matrix

ii. General Exploration

Age:

BMI:

Children:

Sex:

Smoker:

Region:

Conclusion:

D. Feature Scaling and Principal Component Analysis (PCA)

i. Feature Scaling via Standardisation

ii. PCA Implementation

E. K-means Clustering Implementation

i. Identifying the number of clusters to use

Elbow Plot

Silhouette Score Method

(Reference: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html )

ii. Clustering Implementation

Visualisation using PCA axes

Visualisation using original data axes

1. Cluster 1 (Clusters = 2, yellow data points)

2. Cluster 2 (Clusters = 0, blue data points)

3. Cluster 3 (Clusters = 1, pink data points)

In terms of the sex, smoker and region variables, it doesn't seem to have any much impact on the clusters since all clusters have similar percentages in the attributes within the 3 variables.

F. K-medoids Clustering Implementation

i. Clustering Implementation

(Reference: https://scikit-learn-extra.readthedocs.io/en/stable/auto_examples/plot_kmedoids.html#sphx-glr-auto-examples-plot-kmedoids-py)

Visualisation using PCA axes

Visualisation using original data axes

1. Cluster 1 (Clusters = 0, blue data points)

2. Cluster 2 (Clusters = 1, pink data points)

3. Cluster 3 (Clusters = 2, yellow data points)

In terms of the smoker, children and age variable, it doesn't seem to have any much impact on the clusters since all clusters have similar percentages in the attributes within that variables.

ii. Evaluation on the number of clusters used

Elbow Plot

Silhouette Score Method

G. Conclusion

Conclusion

The clustering results obtained can guide the insurers in making useful business decision based on the insights generated.

Using the cluster analysis results, the insurer might segment the customers into the relevant clusters and perform targeted marketing campaigns or pricing decisions to increase revenue streams based on the characteristics of the customers exhibited in each separate clusters.

- K-means clustering

Here are some of the marketing decisions that could be implemented based on the cluster analysis using K-means:

Cluster 1: Some insurance products that can be considered for cross selling include mortality protection and disability insurance cover. This is in consideration that they have large families and any unexpected incidents would still allow the financial needs of their families being met.

Cluster 2: In view of the older aged group of this cluster, the insurer can consider up selling existing health insurance and critical illness plans that will help the older aged group to mitigate cost of medical treatments in old age.

Cluster 3: For younger families that are starting out, their financial priorities lie in family planning, home mortage and saving for their children education. Insurer can consider cross selling home mortage insurance for their new homes. In addition, given their relatively younger age, term life insurance can be marketed to them over whole life insurance since it is more attractive given the cheaper insurance premiums. Critical illness / health insurances plans could be considered as well given that younger aged are less likely to have pre-existing medical conditions. This would certainly increase the insurance sign up rates if such marketing campaigns are targeted in these areas.

- K-medoids clustering

Using the results of K-medoids clustering, I will take a look at the pricing strategies that can be implemented for the different clusters. These strategies are recommended in conjunction with the existing premiums on the different clusters (refer to Appendix A below).

Cluster 1: In view that the policyholders in this group tend to have high BMIs, this could mean that they might be more prone to obesity related diseases and might prompt the insurer to re-evaluate the premiums charged to this group for insurance products such as health and life insurance.

This has been reflected in the existing premiums seen in Appendix A where the median of 11,446 is higher than that of both Cluster 2 (8,610) and Cluster 3 (7,347). The insurer should consider whether the percentage difference in the premiums is sufficient to cover for the risks relating to insuring the health costs associated with obesity related diseases.

Cluster 2: This cluster is made up of majority males.

Cluster 3: This cluster is made up of majority females.

For Cluster 2 and Cluster 3, the people has a lower and healthier BMI than Cluster 1. In such a case, the insurer could consider to provide them with an insurance premium rebate to retain the existing policyholders and attract insurance sign ups given that the insurer take on a lower risk in insuring this cluster.

As for the difference in sex (i.e. males and females) in Cluster 2 and Cluster 3, in terms of health insurance, females tend to have higher health care costs in terms of the maternity and longer life expectancy. However, in Appendix A, it was found out that Cluster 2 (males) has a higher median than Cluster 3 (females). The insurer might want to increase the insurance premiums for Cluster 3 to match that of Cluster 2 given the higher likelihood of health insurance payout involved in insuring females.

Comparison of K-means and K-medoids clustering

Limitations

Future Research

Appendix A - Descriptive Statistics on the Current Premiums (Clustered)